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1 Introduction 

At least since Priestley’s 1765 Chart of Biography |l], large numbers of individual person records 
have been used to illustrate aggregate patterns of cultural history. Wikidata [ 5 ], the structured 
database sister of Wikipedia, currently contains about 2.7 million explicit person records, across 
ah language versions of the encyclopedia. These individuals, notable according to Wikipedia 
editing criteria, are connected via millions of hyperlinks between their respective Wikipedia arti¬ 
cles. This situation provides us with the chance to go beyond the illustration of an idiosyncratic 
subset of individuals, as in the case of Priestly. 

In this work we summarize the overlap of nationalities and occupations, based on their co¬ 
occurrence in Wikidata individuals. We construct networks of co-occurring nationalities and 
occupations, provide insights into their respective community structure, and apply the results 
to select and color chronologically structured subsets of a large network of individuals, con¬ 
nected by Wikipedia hyperlinks. While the imagined communities |3[ of nationality are much 
more discrete in terms of co-occurrence than occupations, our quantifications reveal the exist¬ 
ing overlap of nationality as much less clear-cut than in case of occupational domains. Our 
work contributes to a growing body of research using biographies of notable persons to analyze 
cultural processes [I]- |9[ 

2 Method 

In our processing pipeline (cf. Figure [l]), we use the Wikidata Toolkit 1101 to extract 2.7 million 
records about humans (instances of class Q5), in the form of person - property - value triples, 
from a downloaded Wikidata json dump (09/02/2015). We focus on the properties country of cit¬ 
izenship (P27) and occupation (P106) (numbers see Table[l]A), restricting our analysis to nation¬ 
alities with at least 10 and occupations with at least 100 occurrences. We construct and project 
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the bipartite person-value affiliation matrices to uni-partite matrices of value-co-occurrence. To 
identify relevant co-occurrences, of nationalities or occupations respectively, the projected ma¬ 
trices are compared against a null model. Applying an established approach |lT] |12], we derive 
expected co-occurrence weights from an ensemble of 10,000 degree-preserving random affiliation 
matrices. Co-occurrences with positive Pearson residuals are considered for further analysis 
(numbers see Table m 


The resulting co-occurrence networks, with residuals as edge weights, are subsequently examined 
for community structure using the Louvain method 113] 14 . Detecting communities at different 


granularities, we perform modularity optimization at different resolutions |15], resulting in mul¬ 
tiple partitions with varying numbers of communities. Using these partitions we can replace the 
plain co-occurrence weights in the original value-matrices with the probabilities of two values 
mutually co-occurring in the same community. The resulting mutual community matrix (Fig¬ 
ures [2] and [5]) is then hierarchically clustered, with the resulting tree cut into a preset number 
of clusters (Figures [4] and [5]). The preset number - 28 for nationalities, and 24 for occupations 
- is based on visual inspection of repeated clusterings. 


Visualizations of the backbones of the co-occurrence networks (Figures [6] and [7| show the result¬ 
ing community structure in context. The network backbones are created by iteratively adding 
edges with the largest residuals until the maximal giant connected components (GCC) of the 
original networks are restored. For comparison, we plot the occurrence of nationalities and 
occupations over time, ordered by their first occurrence while disregarding outliers in terms of 
ordering (Figures [8] and [9]) . 


Next, the clusters of nationalities and occupations are used to partition Wikipedia biographies 
into national community and domain specific sub-sets. Hyperlinks connecting Wikipedia arti¬ 
cles about individuals are obtained from DBpedia [16] and filtered to approximate contemporary 
relationships by excluding links between individuals with birth dates more than 75 years apart. 
Using hyperlinks from the English Wikipedia, we visualize the giant connected component of 
the partition of individuals connected to occupations in the community of ’’arts, architecture, 
crafts, and design” (Figure |Io|). Colored by nationality cluster (cf. Figures HE© , the visualiza¬ 
tion connects 22,825 nodes with 78,447 edges. We also visualize the giant connected component 
of the partition of individuals connected to nationalities in the community of ’’predominantly 
English speaking countries” (Figure [Tl]). Colored by occupational domain (cf. Figures |3|5|7| ) 
the visualization connects 160,913 nodes with 1,004,415 edges. While the arts domain (Figure 


10) seems to reflect the established narrative of art history where a sequence of nationalities 


dominates at different points of time, the predominantly English speaking partition (Figure |TT|) 
is clearly characterized by a more complex structure that excludes the construction of a simple 
narrative. 


2 







3 Conclusion 


In sum, we characterize networks of co-occurring nationalities and occupations related to Wiki¬ 
data individuals. Our quantifications indicate that communities of nations derived from co¬ 
occurrence are much more complex than the rather clear-cut communities of occupational do¬ 
main. This may be due to substantially more complex social processes leading to co-citizenship, 
as we observe in (post)colonial ties, due to the potentially vague concept of citizenship/nationality 
itself [3], as found in references to bygone and transient national constructs, or due to the con¬ 
siderable difference in the amount of available data (93,661 citizenship vs. 585,407 occupation 
co-references). Our approach can be used to group synonyms and attributions of differing gran¬ 
ularity, occurring due to the free nature of Wikidata. 

Algorithmically mining occupational domains from a large set of individuals, we create an al¬ 
ternative to manually curated meta-domains of occupation, as used in multiple strains of recent 
research U !• Deriving domain specific groups of individuals directly from a crowd-sourced 
ecosystem, such as Wikipedia, we also provide a useful alternative (Figure [To]) to using expert 
curated datasets, such as the Getty Union List of Artist Names |17| as used to analyze the 
domain of art history in previous work |9|. Visualizing the Wikipedia hyperlink sub-networks 
of such domain specific groups of individuals reveals network patterns that would be obscured 
when using the network as a whole. 
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Figure 1: Data processing pipeline 


Table 1: Person links to Nation/Occupation (A) and Nation/Occupation Co-occurrences (B) 




Nationalities 



Occupations 


A 

^persons 

#P27 links 

^nationalities 

^persons 

#P106 links 

^occupations 

Raw data 

1 , 318,484 

1 , 366,777 

833 

1 , 363,032 

1 , 706,766 

3,419 

Reduced data 

1 , 317,676 

1 , 365,600 

282 

1 , 352,909 

1 , 685,000 

431 

1:1 References 

1 , 271,939 

1 , 271,939 

282 

1 , 099,593 

1 , 099,593 

431 

l:n References 

45,737 

93,661 

282 

253,316 

585,407 

430 

B 

#co- 

■occurrences 

#nodes 

#co 

-occurrences 

#nodes 

All 


2,100 

282 


13,846 

430 

Positive 


1,565 

282 


7,641 

430 

Backbone 


996 

282 


2,964 

430 
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Figure 2: Louvain communities of co-occurring nationalities at different resolutions 



Figure 3: Louvain communities of co-occurring occupations at different resolutions 
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Figure 4: Hierarchical clustering of communities of co-occurring nationalities 
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Figure 5: Hierarchical clustering of communities of co-occurring occupations 








Figure 6: Network of national overlap through co-occurrence, colored by community 
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Figure 7: Network of co-occurring occupations, colored by community 
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Figure 8: Nationalities over time based on person life-spans 
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Figure 9: Occupations over time based on person life-spans 
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Figure 10: Hyperlink network of English Wikipedia biographies having occupations in ’’arts, 
architecture, crafts and design”, colored by nationality community corresponding to the colors 
in figures |2|4|6 
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Figure 11: Hyperlink network of English Wikipedia biographies with a nationality in the ” pre¬ 
dominantly english speaking” community, colored by occupation community corresponding to 
the colors in figures |3|5(7| 
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